Confusion modelling for automated lip-reading usingweighted finite-state transducers
نویسندگان
چکیده
Automated lip-reading involves recognising speech from only the visual signal. The accuracy of current state-ofthe-art lip-reading systems is significantly lower than that obtained by acoustic speech recognisers. These poor results are most likely due to the lack of information about speech production that is available in the visual signal: for example, it is impossible to discriminate voiced and unvoiced sounds, or many places of articulation, from visual signals. Our approach to this problem is to regard the visual speech signal as having been produced by a speaker who has a reduced phonemic repertoire and to attempt to compensate for this. In this respect, visual speech is similar to dysarthric speech, which is produced by a speaker who has poor control over their articulators, leading them to speak with a reduced and distorted set of phonemes. In previous work, we found that the use of weighted finite-state transducers improved recognition performance on dysarthric speech considerably. In this paper, we report the results of applying this technique to lip-reading. The technique works, but our initial results are not as good as those obtained by using a conventional approach, and we discuss why this might be so and what the prospects for future investigation are.
منابع مشابه
To build a model for implementing automated lip reading which involves Lip motion feature to text conversion
A speech recognition system has three major components: feature extraction, probabilistic modelling of features and classification. In literature, the general approach is to extract the principle components of the lip movement in terms of the lip shape based properties in order to establish a one-to-one correspondence between phonemes of speech and visemes of lip shape. Several modelling and cl...
متن کاملImproving visual features for lip-reading
Automatic speech recognition systems that utilise the visual modality of speech often are investigated within a speakerdependent or a multi-speaker paradigm. That is, during training the recogniser will have had prior exposure to example speech from each of the possible test speakers. In a previous paper we highlighted the danger of not using different speakers in the training and test sets, an...
متن کاملLimitations of visual speech recognition
In this paper we investigate the limits of automated lip-reading systems and we consider the improvement that could be gained were additional information from other (non-visible) speech articulators available to the recogniser. Hidden Markov model (HMM) speech recognisers are trained using electromagnetic articulography (EMA) data drawn from the MOCHA-TIMIT data set. Articulatory information is...
متن کاملSilence models in weighted finite-state transducers
We investigate the effects of different silence modelling strategies in Weighted Finite-State Transducers for Automatic Speech Recognition. We show that the choice of silence models, and the way they are included in the transducer, can have a significant effect on the size of the resulting transducer; we present a means to prevent particularly large silence overheads. Our conclusions include th...
متن کاملBimachines and Structurally-Reversed Automata
Although bimachines are not widely used in practice, they represent a central concept in the study of rational functions. Indeed, they are finite state machines specifically designed to implement rational word functions. Their modelling power is equal to that of single-valued finite transducers. From the theoretical point of view, bimachines reflect the decomposition of a rational function into...
متن کامل